Credit Card Fraud Detection

Many machine learning techniques have risen dramatically in popularity amongst the financial sector over the past two decades, particularly for cases of fraudulent transaction and their detection. We shall seek to accomplish a similar goal. Our data set of choice contains nearly 28,500 credit card transactions and multiple unsupervised anomaly detection algorithms, we are going to identify transactions with a high probability of being credit card fraud. We shall build and make use of two algorithms:

  • Local Outlier Factor (LOF)
  • Isolation Forest Algorithm

The difficulties in credit card fraud detection situations is that credit card transactions data suffer from a strong imbalance regarding the class labels which needs to be considered either from the classifier perspective or from the data perspective (less than 1% of the transactions are fraudulent transactions).

We shall be using metrics such as precision, recall, and F1-scores as metrics of measurements. These metrics shall be further discussed later on in the tutorial. We also investigate why the classification accuracy for these algorithms can be misleading.

The plan of development for the project is as follows. We shall begin by setting up the environment, reading in data and loading packages. We then proceed to data cleaning. This is followed by some brief data exploration - parameter histograms and correlation matrices, to gain a better understanding of the underlying distribution of data in our data set. We then shall seek to build our machine learning models. Finally we assess our models using a variety of metrics.

Machine Learning Algorithms

As mentioned we shall apply two anomaly detection algorithms; local outlier factor and isolation forest algorithm. Before discussing these algorithms, its important to highlight what constitutes an outlier/anomaly. Any data point/observation that deviates significantly from the other observations is called an anomaly/outlier. As such, anomaly detection has become very significant in practical setting with real world data. The techniques under anomaly detection find its application in various domains like detection of fraudulent bank transactions, network intrusion detection, sudden rise/drop in sales, change in customer behavior, etc.

Local Outlier Factor (LOF)

This is an unsupervised anomaly detection method which produces an anomaly score that represents data points which are outliers in the data set. This is done by computing the local density deviation of a given data point with respect to its neighbors. It considers as outliers the samples that have a substantially lower density than their neighbors.

An example of the algorithm is given here and a good breakdown of the algorithm is available here. .

Isolation Forest Algorithm

This is an unsupervised anomaly detection technique. Isolation Forests (IF), similar to Random Forests, are build based on decision trees. And since there are no pre-defined labels here, it is an unsupervised model. Isolation Forests were built based on the fact that anomalies are the data points that are “few and different”. Here, randomly sub-sampled data is processed in a tree structure based on randomly selected features. The samples that travel deeper into the tree are less likely to be anomalies as they required more cuts to isolate them. Similarly, the samples which end up in shorter branches indicate anomalies as it was easier for the tree to separate them from other observations.

So essentially we begin by being provided with the data and creating an ensemble of decision trees. During scoring, a data point is traversed through all the trees which were trained earlier. Now, an ‘anomaly score’ is assigned to each of the data points based on the depth of the tree required to arrive at that point. This score is an aggregation of the depth obtained from each of the Trees. An anomaly score of -1 is assigned to anomalies and 1 to normal points based on the contamination(percentage of anomalies present in the data) parameter provided.

The basic idea here is that, anomaly data points (transactions) are very different so would be very easily separated (short branches).

More information on the algorithm is found here.


Setup

We begin by importing necessary libraries. We also print out the version numbers of all the libraries we will be using in this project, as this helps ensure we are using the correct libraries but also allows the work to be reproducible.

In [7]:
#Modules
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

#Versions
print("Versions of modules")
print('Python: {}'.format(sys.version))
print('Numpy: {}'.format(np.__version__))
print('Pandas: {}'.format(pd.__version__))
print('Seaborn: {}'.format(sns.__version__))
print('Scipy: {}'.format(scipy.__version__))

import warnings
warnings.filterwarnings('ignore')
Versions of modules
Python: 3.9.12 (main, Apr  5 2022, 01:53:17) 
[Clang 12.0.0 ]
Numpy: 1.21.5
Pandas: 1.4.2
Seaborn: 0.11.2
Scipy: 1.7.3

Now lets load in our data set from a .csv file as a Pandas DataFrame. And view the columns in data frame.

In [10]:
# Load the dataset from the csv file using pandas
data = pd.read_csv('/Users/pavansingh/Library/CloudStorage/GoogleDrive-pavansingho23@gmail.com/My Drive/Portfolio/Projects/Python/Credit Card Fraud /creditcard.csv')
In [11]:
# Show columns or data variables
print(data.columns)
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

Data Description

The data set is hosted on Kaggle. The actual data has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group of ULB on big data mining and fraud detection.

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. This is a common problem with fraudulent transaction data sets - they are extremely imbalanced.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification

We begin by just seeing the data shape and summary of the data.

In [12]:
# Print the shape of the data
print("Shape: ", data.shape)
data = data.sample(frac=0.2, random_state = 1) #get a sample of the full data now
Shape:  (284807, 31)
In [13]:
# Summary
print(data.describe().round(1).transpose())
          count     mean      std   min      25%      50%       75%       max
Time    56961.0  94571.2  47566.9   0.0  53809.0  84511.0  139237.0  172784.0
V1      56961.0      0.0      1.9 -46.9     -0.9      0.0       1.3       2.4
V2      56961.0     -0.0      1.7 -63.3     -0.6      0.1       0.8      17.4
V3      56961.0      0.0      1.5 -31.8     -0.9      0.2       1.0       4.1
V4      56961.0      0.0      1.4  -5.3     -0.8     -0.0       0.8      16.7
V5      56961.0     -0.0      1.4 -42.1     -0.7     -0.1       0.6      34.1
V6      56961.0      0.0      1.3 -23.5     -0.8     -0.3       0.4      22.5
V7      56961.0     -0.0      1.2 -26.5     -0.6      0.0       0.6      36.7
V8      56961.0      0.0      1.2 -33.8     -0.2      0.0       0.3      19.6
V9      56961.0      0.0      1.1  -8.7     -0.6     -0.0       0.6      10.4
V10     56961.0      0.0      1.1 -18.3     -0.5     -0.1       0.5      12.9
V11     56961.0     -0.0      1.0  -4.1     -0.8     -0.0       0.7      11.2
V12     56961.0     -0.0      1.0 -18.6     -0.4      0.1       0.6       4.4
V13     56961.0      0.0      1.0  -4.0     -0.6     -0.0       0.7       4.4
V14     56961.0      0.0      0.9 -18.0     -0.4      0.0       0.5       7.4
V15     56961.0      0.0      0.9  -4.2     -0.6      0.1       0.7       5.8
V16     56961.0      0.0      0.9 -12.7     -0.5      0.1       0.5       6.4
V17     56961.0      0.0      0.8 -25.2     -0.5     -0.1       0.4       9.3
V18     56961.0      0.0      0.8  -9.0     -0.5     -0.0       0.5       5.0
V19     56961.0     -0.0      0.8  -7.2     -0.5     -0.0       0.5       5.2
V20     56961.0      0.0      0.8 -23.4     -0.2     -0.1       0.1      39.4
V21     56961.0      0.0      0.7 -16.6     -0.2     -0.0       0.2      22.6
V22     56961.0      0.0      0.7 -10.9     -0.5      0.0       0.5       6.1
V23     56961.0     -0.0      0.7 -36.7     -0.2     -0.0       0.1      18.9
V24     56961.0     -0.0      0.6  -2.8     -0.4      0.0       0.4       4.0
V25     56961.0      0.0      0.5  -7.0     -0.3      0.0       0.4       5.5
V26     56961.0      0.0      0.5  -2.5     -0.3     -0.0       0.2       3.2
V27     56961.0      0.0      0.4  -8.3     -0.1      0.0       0.1      11.1
V28     56961.0     -0.0      0.3  -9.6     -0.1      0.0       0.1      15.4
Amount  56961.0     88.8    254.7   0.0      6.0     22.2      77.9   19656.5
Class   56961.0      0.0      0.0   0.0      0.0      0.0       0.0       1.0

V1 - V28 are the results of a PCA Dimensionality reduction to protect user identities and sensitive features.


Data Exploration

In [14]:
# Plot histograms of each parameter 
data.hist(figsize = (20, 20))
plt.show()

We see several variables exhbit close to norma ldistributions, namely V13, V15, V24 and V26. Time variable appears to be bi-modal, with two peaks.

Lets see how many fradulent cases we have in our data set. We print these out below.

In [15]:
# Determine number of fraud cases in dataset
Fraud = data[data['Class'] == 1]
Valid = data[data['Class'] == 0]

# Outliers
outlier_fraction = len(Fraud)/float(len(Valid))
print("Outlier Fraction: ", outlier_fraction)

print('Fraud Cases: {}'.format(len(data[data['Class'] == 1])))
print('Valid Transactions: {}'.format(len(data[data['Class'] == 0])))
Outlier Fraction:  0.0015296972254457222
Fraud Cases: 87
Valid Transactions: 56874
In [16]:
# Correlation matrix
corrmat = data.corr()
fig = plt.figure(figsize = (12, 9))

sns.heatmap(corrmat, vmax = .8, square = True)
plt.show()

Majoriyy of variables appear to have little to no correaltion (very weak correaltion) between each other. Notable correaltions that do appear are:

  • V20 and amount - strong positive correlation
  • V7 and amount - strong positive correlation
  • V14, V10, V17 and class - each have a relatively weak but present correaltion

It would be better to view the numbers. Let's achieve that. Below is a plot of the correaltion heat map with numbers.

In [17]:
sns.set(rc = {'figure.figsize':(30,20)})                                 # used to control the theme and configurations of the seaborn plot
sns.heatmap(data.corr(), annot=True, cmap = "summer")            # heatmap with seaborn
Out[17]:
<AxesSubplot:>

We can identify the correlation using this graphic.

Let's proceed and derive our X and Y data. Our X data is what the we shall be training our models and used for prediction (predictors). Our Y data is the response we shall be predicting.

In [56]:
# Get all the columns from the dataFrame
columns = data.columns.tolist()

# Filter the columns to remove data we do not want
columns = [c for c in columns if c not in ["Class"]]

# Store the variable we'll be predicting on
target = "Class"

X = data[columns]
Y = data[target]

# Print shapes
print("X data shape: ",X.shape)
print("Y data shape: ", Y.shape)
X data shape:  (56961, 30)
Y data shape:  (56961,)

Unsupervised Outlier Detection

Now that we have processed our data and briefly analysed it, we can begin deploying our machine learning algorithms. As discussed, we will use the following techniques:

  1. Local Outlier Factor (LOF)
  • The anomaly score of each sample is called Local Outlier Factor. It measures the local deviation of density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood.
  1. Isolation Forest Algorithm
  • The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since recursive partitioning can be represented by a tree structure, the number of splittings required to isolate a sample is equivalent to the path length from the root node to the terminating node. This path length, averaged over a forest of such random trees, is a measure of normality and our decision function.

  • Random partitioning produces noticeably shorter paths for anomalies. Hence, when a forest of random trees collectively produce shorter path lengths for particular samples, they are highly likely to be anomalies.

Let's load the modules we need and we define a seed state.

In [59]:
# Modules Needed for Anomaly Detection
from sklearn.metrics import classification_report, accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor

# define random states
state = 1

Let's set up the algorithms and the arguments for which we shall model.

In [60]:
# define outlier detection tools to be compared
classifiers = {
    "Isolation Forest": IsolationForest(max_samples=len(X),
                                        contamination=outlier_fraction,
                                        random_state=state),
    "Local Outlier Factor": LocalOutlierFactor(
        n_neighbors=20,
        contamination=outlier_fraction)}

Now lets fit our model to the data

In [66]:
# Fit the model
plt.figure(figsize=(9, 7))
n_outliers = len(Fraud)

for i, (clf_name, clf) in enumerate(classifiers.items()):
    
    # fit the data and tag outliers
    if clf_name == "Local Outlier Factor":
        y_pred = clf.fit_predict(X)
        scores_pred = clf.negative_outlier_factor_
    else:
        clf.fit(X)
        scores_pred = clf.decision_function(X)
        y_pred = clf.predict(X)

    # Reshape the prediction values to 0 for valid, 1 for fraud.
    y_pred[y_pred == 1] = 0
    y_pred[y_pred == -1] = 1

    n_errors = (y_pred != Y).sum()

    # Run classification metrics
    print('{}: {}'.format(clf_name, n_errors))
    print(accuracy_score(Y, y_pred))
    print(classification_report(Y, y_pred))
/Users/pavansingh/opt/anaconda3/lib/python3.9/site-packages/sklearn/base.py:450: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature names
  warnings.warn(
Isolation Forest: 127
0.997770404311722
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56874
           1       0.27      0.28      0.27        87

    accuracy                           1.00     56961
   macro avg       0.64      0.64      0.64     56961
weighted avg       1.00      1.00      1.00     56961

Local Outlier Factor: 173
0.9969628342199048
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56874
           1       0.01      0.01      0.01        87

    accuracy                           1.00     56961
   macro avg       0.50      0.50      0.50     56961
weighted avg       1.00      1.00      1.00     56961

<Figure size 648x504 with 0 Axes>

After running our algorithms we see that there is substanital improvement in results using the Isolation forest algorithm over Local Outlier Factor.